Tools: Python, Pandas, Altair, Jupyter Notebook
Dataset: Netflix Movies & TV Shows (Kaggle)
This project analyzes the Netflix Titles dataset through data cleaning, feature engineering, and interactive visualizations. The goal is to understand how Netflix's catalog has evolved over time, how genres and ratings are distributed, which countries produce the most content, and how long it takes titles to appear on Netflix after their original release.
The notebook includes:
All visualizations use Altair and are fully interactive when run locally or viewed through NBViewer.
Task & Goal Identification:
The project focuses on questions such as:
Overall, the project provides a comprehensive and interactive look at how Netflix’s content library is structured and how it has evolved, while demonstrating thoughtful design decisions and data-cleaning practices.
import pandas as pd
import altair as alt
# Global chart sizing
CHART_WIDTH = 750
CHART_HEIGHT = 350
alt.themes.enable('default')
alt.data_transformers.disable_max_rows() # allow > 5k rows if needed
The original dataset contains 8,807 rows and was sampled down to 3,000 rows to meet performance recommendations for Altair visualizations.
In this section, we load the full dataset to create the subset.
df_full = pd.read_csv("data/netflix_titles.csv")
df_full.shape, df_full.head()
# Quick data overview
df_full.info()
df_full.isna().sum()
Altair performs best with datasets under ~5,000 rows.
We generate a random sample of 3,000 rows using a fixed random seed for reproducibility.
df_subset = df_full.sample(n=3000,random_state=42)#Setting a specific random seed, like 42, makes the sequence of numbers generated by the computer always the same. "Ensures reproducibility"
# Save subset inside the repo's data folder
df_subset.to_csv("data/netflix_titles_subset_3000.csv", index=False)
df_subset.head()
In order to prepare the Netflix dataset for effective analysis and visualization, several cleaning and transformation steps were applied. These steps help simplify the dataset, standardize formats, and create new variables that better support visual exploration.
To prepare the dataset for visualization, we:
date_added to datetimeyear_addedduration into numeric and type fieldsThese steps simplify the dataset and make it easier to visualize trends.
📌 1. Handling Whitespace and Text Formatting
Many text fields (e.g., title, director, cast, country, listed_in) contained trailing or leading whitespace. To ensure clean categorical grouping and avoid mismatched categories, whitespace was removed from all string-based columns.
Reason:
Ensures consistent grouping
Prevents categories like "Drama" and "Drama " from being treated as different
df = df_subset.copy()
# Strip whitespace
str_cols = df.select_dtypes(include="object").columns
for col in str_cols:
df[col] = df[col].str.strip()
📌 2. Parsing date_added and Extracting year_added
The raw date_added column contains full dates (e.g., "September 9, 2019"). This was converted to a datetime object and a new year_added field was extracted.
# Convert date_added
df["date_added"] = pd.to_datetime(df["date_added"], errors="coerce")
df["year_added"] = df["date_added"].dt.year.astype("Int64")
📌 3. Splitting duration
The duration column mixes numbers and units (e.g., "90 min", "2 Seasons"). This was split into:
duration_int → numerical value
duration_type → "min" or "Season(s)"
# Split duration
duration_split = df["duration"].str.split(" ", n=1, expand=True)
df["duration_int"] = pd.to_numeric(duration_split[0], errors="coerce")
df["duration_type"] = duration_split[1]
📌 4. Extracting Primary Country
Some titles list multiple countries (e.g., "United States, Canada"). A simplified country_primary field was created using the first listed country.
# Extract primary country
def get_first_country(x):
if pd.isna(x): return None
return x.split(",")[0].strip()
df["country_primary"] = df["country"].apply(get_first_country)
📌 5. Extracting Main Genre
The listed_in column often contains multiple genres, such as: "International Movies, Dramas, Thrillers"
A new field main_genre was created by selecting the first listed genre.
# Extract main genre
def get_main_genre(x):
if pd.isna(x): return None
return x.split(",")[0].strip()
df["main_genre"] = df["listed_in"].apply(get_main_genre)
df.head()
In this section, we use Altair to build interactive and static visualizations that answer the core questions about Netflix’s catalog: growth over time, genre distribution, rating patterns, country contributions, and the delay between release and being added to Netflix.
Interactive exploration
Sliders and hover effects allow users to explore trends (such as content growth and country comparisons) without visual clutter.
Consistent, Netflix-inspired colors
Using red and dark tones creates a cohesive look that matches Netflix branding and improves readability across all charts.
Simplified noisy fields
Complex fields (ratings, multiple genres, multi-country listings) were cleaned and grouped to highlight clearer patterns in the dataset.
Normalized views for fair comparison
The stacked area chart uses proportional values to show how genre share changes over time, giving a more accurate picture than raw counts.
Multiple coordinated charts
Using bar charts, heatmaps, and interactive comparisons provides complementary perspectives and supports a deeper understanding of Netflix’s catalog.
Figure 4.1 — Growth of Netflix Titles Over Time (Interactive Slider)
This visualization explores how Netflix’s catalog expanded between 2004 and the present using an interactive slider.
import altair as alt
# Filter to modern years
df_modern = df[df["release_year"] >= 2004]
# Define slider: user can choose the *maximum* year to display
year_slider = alt.binding_range(min=2004, max=int(df_modern["release_year"].max()), step=1)
year_param = alt.param("Year", value=int(df_modern["release_year"].max()), bind=year_slider)
chart_year_slider = (
alt.Chart(df_modern)
.mark_line(point=True)
.encode(
x=alt.X(
"release_year:O",
axis=alt.Axis(title="Release Year", labelAngle=45)
),
y=alt.Y(
"count():Q",
title="Number of Titles"
),
color=alt.Color(
"type:N",
title="Content Type",
scale=alt.Scale(
# Netflix-inspired colors: red for Movies, dark gray for TV Shows
domain=["Movie", "TV Show"],
range=["#E50914", "#221F1F"]
)
),
tooltip=["type", "release_year", "count()"]
)
.add_params(year_param)
.transform_filter("datum.release_year <= Year")
.properties(
width= CHART_WIDTH,
height= CHART_HEIGHT,
title="Netflix Titles by Release Year (2004–Present) – Interactive"
)
)
chart_year_slider
⚠️ Note: Years before 2004 contain very few titles, which compresses the chart.
Interpretation:
This chart shows a clear upward trend in the number of titles released from 2004 onward. Before the mid-2010s, Netflix’s catalog grows slowly. Starting around 2014–2016, the number of new titles increases much more rapidly, reflecting Netflix’s global expansion and the introduction of original content. Movies dominate the platform in earlier years, but TV Shows become increasingly common in later years. The overall pattern suggests strong growth in both content volume and diversity during the last decade.
The slider allows users to explore how the catalog looked at different points in time and observe how quickly content availability grows during the 2010s.
Figure 4.2 — Top 10 Most Common Netflix Genres
This chart shows the ten most common primary genres in the dataset. The main_genre
column represents the first genre listed for each title, giving a simplified but
consistent way to group categories. This visualization helps identify the genres
Netflix relies on the most in its catalog.
# Compute top 10 genres
genre_counts = df["main_genre"].value_counts().nlargest(10).reset_index()
genre_counts.columns = ["genre", "count"]
chart_genre_counts = (
alt.Chart(genre_counts)
.mark_bar(color="#E50914") # Netflix red
.encode(
x="count:Q",
y=alt.Y("genre:N", sort="-x"),
tooltip=["genre", "count"]
)
.properties(
width=CHART_WIDTH,
height=CHART_HEIGHT,
title="Top 10 Netflix Genres"
)
)
chart_genre_counts
Interpretation:
The chart shows that Dramas and Comedies appear most frequently on Netflix, followed by
Documentaries and International content. This highlights Netflix’s emphasis on broadly appealing
and globally relevant genres.
💡 Insight: Dramas and Comedies consistently dominate Netflix’s library.
Figure 4.3 — Genre vs Rating Category Heatmap
This heatmap shows how Netflix genres relate to content ratings. Darker colors represent more titles in a specific genre–rating combination. The chart is faceted by content type (Movies vs TV Shows) to highlight differences between the two formats.
# Build the simplified rating group column
# Top genres
def simplify_rating(r):
if r in ["TV-MA", "R", "NC-17"]:
return "Mature"
elif r in ["TV-14", "PG-13"]:
return "Teen"
else:
return "Family/Kids"
df_genre = df[df["main_genre"].notna() & df["rating"].notna()].copy()
df_genre["rating_group"] = df_genre["rating"].apply(simplify_rating)
top_genres = genre_counts["genre"].tolist()
df_genre_top = df_genre[df_genre["main_genre"].isin(top_genres)]
# Use only Top 10 genres for heatmap
heatmap_simple = (
alt.Chart(df_genre_top)
.mark_rect()
.encode(
x=alt.X("rating_group:N", title="Rating Category"),
y=alt.Y("main_genre:N", title="Main Genre", sort=top_genres),
color=alt.Color("count():Q",
title="Number of Titles",
scale=alt.Scale(scheme="reds")),
tooltip=["main_genre", "rating_group", "count()"]
)
.properties(
width=CHART_WIDTH * 0.5,
height=CHART_HEIGHT,
title="Genre vs. Rating Category"
)
)
heatmap_simple
Interpretation:
This heatmap shows how different primary genres distribute across rating categories.
Documentaries, Dramas, and Comedies appear most frequently across all ratings, with a
particularly strong concentration in the Teen and Mature categories. Family/Kids ratings
are much less common overall, indicating that Netflix's catalog skews toward older audiences.
Figure 4.4 — Relative Genre Distribution Over Time (2004–Present)
This stacked area chart shows how the share of each top genre changes over time. Stacking is normalized to 100%, so we see relative composition rather than raw counts.
genre_year = (
df[df["release_year"] >= 2004]
.groupby(["release_year", "main_genre"])
.size()
.reset_index(name="count")
)
genre_year = genre_year[genre_year["main_genre"].isin(top_genres)]
chart_genre_over_time = (
alt.Chart(genre_year)
.mark_area()
.encode(
x=alt.X("release_year:O", title="Release Year"),
y=alt.Y("count:Q", stack="normalize", title="Share of Titles"),
color=alt.Color("main_genre:N", title="Main Genre"),
tooltip=["release_year", "main_genre", "count"]
)
.properties(
width=CHART_WIDTH,
height=CHART_HEIGHT,
title="Relative Genre Share Over Time (Top 10 Genres)"
)
)
chart_genre_over_time
Interpretation:
This stacked area chart shows how the proportional share of top genres has shifted over time
on Netflix. Dramas and Comedies maintain a consistently large share throughout the entire period,
reflecting their broad global demand. Documentaries and International TV Shows grow noticeably
after 2015, suggesting an increase in global content acquisition and niche audience expansion.
Because the chart is normalized, we see how genre balance changed—not just total volume—making
it clear that Netflix became more genre-diverse as the platform scaled.
Figure 4.5 — Countries Producing the Most Netflix Titles
This chart shows the countries with the highest number of titles. Only the primary
country listed for each title (country_primary) is used to avoid double-counting
multi-country entries. The chart highlights the geographic concentration of content
production on Netflix.
country_counts = (
df["country_primary"].value_counts().nlargest(15).reset_index()
)
country_counts.columns = ["country", "count"]
chart_country_counts = (
alt.Chart(country_counts)
.mark_bar(color="#221F1F")
.encode(
x=alt.X("count:Q", title="Number of Titles"),
y=alt.Y("country:N", sort="-x", title="Country"),
tooltip=["country", "count"]
)
.properties(
width=CHART_WIDTH,
height=CHART_HEIGHT,
title="Top 15 Countries Producing Netflix Titles"
)
)
chart_country_counts
Interpretation:
The United States and India dominate Netflix's catalog, contributing far more titles than any
other country. The next highest contributors—such as the United Kingdom, Japan, Canada, and
South Korea—represent strong regional production hubs. The chart highlights how Netflix’s content
library is heavily influenced by Hollywood and Bollywood, with growing representation from
Asian and European markets.
Figure 4.6 — Interactive Country Comparison (Hover to Highlight)
This interactive chart provides a clearer comparison between countries by allowing users to hover over each bar and temporarily highlight it. This makes it easier to focus on individual countries and compare their contribution to Netflix’s catalog without visual clutter.
highlight = alt.selection_point(on='mouseover', fields=['country'])
chart_country_highlight = (
alt.Chart(country_counts)
.mark_bar()
.encode(
x="count:Q",
y=alt.Y("country:N", sort="-x"),
color=alt.condition(
highlight,
alt.value("#E50914"), # highlight in Netflix red
alt.value("#221F1F") # default dark gray
),
tooltip=["country", "count"]
)
.add_params(highlight)
.properties(
width=CHART_WIDTH,
height=CHART_HEIGHT,
title="Interactive Country Comparison"
)
)
chart_country_highlight
Interpretation:
This interactive version allows users to hover over bars to highlight one country at a time,
making comparison easier than in the static chart. The United States and India remain the clear
leaders, but the interaction helps reveal subtler differences among mid-range contributors like
the UK, Japan, Canada, and South Korea. This chart is particularly useful when examining how
countries compare individually without visual clutter.
Figure 4.7 — Lag Between Release Year and Netflix Addition
This histogram shows how long it takes for titles to appear on Netflix after their original release.
# Compute the lag in years
lag_df = df[df["year_added"].notna()].copy()
lag_df["lag_years"] = lag_df["year_added"] - lag_df["release_year"]
lag_hist = (
alt.Chart(lag_df)
.mark_bar()
.encode(
x=alt.X("lag_years:Q",
bin=alt.Bin(step=1),
title="Years Between Release and Added to Netflix"),
y=alt.Y("count():Q", title="Number of Titles"),
color=alt.Color("type:N",
title="Type",
scale=alt.Scale(domain=["Movie", "TV Show"],
range=["#E50914", "#221F1F"])),
tooltip=["lag_years", "count()"]
)
.properties(
width=CHART_WIDTH,
height=CHART_HEIGHT,
title="Distribution of Delay Between Release and Being Added to Netflix"
)
)
lag_hist
Interpretation:
The distribution of lag years shows that most titles are added to Netflix relatively soon after
their original release, with the highest concentration occurring between 0 and 5 years. The
frequency drops steadily as lag increases, suggesting that Netflix prioritizes acquiring content
that is recent or still culturally relevant. Very large lag values are uncommon, indicating that
older titles are added less frequently.
The goal of this evaluation was to determine whether the visualizations effectively supported the main analytical questions of the project—specifically, understanding trends in Netflix content growth, genre patterns, country contributions, and the lag between release year and Netflix addition.
Participants were recruited from classmates, friends, and coworkers who regularly use Netflix but do not specialize in data visualization. This group represents typical streaming users and is appropriate for evaluating whether the visualizations communicate insights clearly to a general audience.
Participants interacted with the visualization report and completed a small set of tasks, such as:
During the session, they provided feedback on clarity, usability, and how intuitive the interactive elements were.
The evaluation considered several measures:
Participants reported that the visualizations were clear, visually appealing, and easy to navigate. The interactive elements (slider and highlighting) were particularly effective in helping them explore trends without feeling overwhelmed. Some participants suggested adding short explanatory notes under each chart, which led to the final added interpretation text. Overall, the feedback confirmed that the visualizations successfully communicated the intended insights and were accessible to non-expert users.
This project analyzed a subset of Netflix titles to uncover patterns in content growth, genre distribution, ratings, geographic representation, and release timelines. Through a combination of data cleaning, feature engineering, and interactive visualization, several meaningful insights emerged.
Netflix’s catalog has grown steadily, with a sharp increase in both Movies and TV Shows after the mid-2010s. The interactive slider highlights how quickly the platform expanded its offerings year by year.
Genre analysis revealed that Dramas, Comedies, Documentaries, and International TV Shows are among the most dominant genres. Visualization of the top 10 genres shows a clear skew toward story-driven and international content.
By grouping content ratings into Family/Kids, Teen, and Mature categories, the heatmap becomes easier to interpret. Most mature-rated content appears in Dramas, Crime shows, and Action genres, while Kids’ content remains highly specialized. This grouping decision improved clarity and reduced visual clutter compared to showing every individual rating.
The country bar charts show that the United States and India contribute the largest volume of Netflix content, with notable contributions from the United Kingdom, Japan, Canada, and South Korea. An interactive highlight chart makes country comparison more intuitive.
The lag histogram reveals that most titles are added to Netflix within 0–10 years of their original release, with a noticeable concentration near low lag values. This trend suggests that Netflix is increasingly acquiring or producing content closer to its release year, especially for TV shows.
Future iterations could:
Netflix’s catalog is diverse, global, and increasingly rapid in content acquisition.